lm(), geom_smooth(), geom_ribbon(), plot.lm(), coef(), anova(), broom::tidy(), predict(), ggPredict(), expand.grid(), expand(), broom::augment(), AIC(), step(), vif()
Linear regression is one of the most commonly-used and easy-to-understand approaches to modeling. Linear regression involves a numerical outcome variable y and explanatory variable(s) x that are either numerical or categorical. We will start with simple (univariate) linear regression where y is modeled as a function of a single x variable.
The mathematical equation for simple regression is as follows: \[y = \beta_1 + \beta_2*x + \varepsilon\] where \(\beta_1\) is the y-intercept, \(\beta_2\) is the slope and \(\varepsilon\) is the error term, or the portion of y that the regression model can’t explain.
Linear regression has 5 key assumptions:
Additionally, a good sample size rule of thumb is that the regression analysis requires at least 20 cases per independent variable in the analysis.
Here are some plots you want to look at before building a linear regression model:
GGally::ggpairs()) to check for multicollinearity in your explanatory variablesIf you do encounter some problems with your data, there are many solutions that can help make linear regression an appropriate analysis. For example, if your explanatory variables aren’t normal or you have heterscedasticity, a nonlinear transformation (such as \(log(x)\), \(x^2\) or \(\sqrt{x}\)) may solve the issue. Heteroscedasticity can also be dealt with by calculating robust standard errors for your linear model coefficients. If you have some nasty outliers, think about whether you might be (scientifically) justified in removing them from the dataset. If several of your explanatory variables are correlated, you can consider removing some of them using stepwise regression. These will be covered in detail in a statistics class.
Let’s use the palmerpenguins data to look again at the relationship between bill length and bill depth. I’ll do a little exploratory data analysis by viewing the first few data points and calculating summary statistics, then I’ll look at the density plots and correlation with GGally::ggpairs()
library(tidyverse)
library(palmerpenguins)
library(GGally) # ggPairs()
library(ggiraph)
library(ggiraphExtra) # ggPredict()
library(broom) # tidy() augment()
library(car) # vif()
# Exploratory data analysis:
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex <fct> male, female, female, NA, female, male, female, mal…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
summarize(penguins)
## # A tibble: 1 x 0
# visual correlation matrix for all continuous variables
penguins %>%
select(species, body_mass_g, ends_with("_mm")) %>% # select variables with names that end in "_mm"
GGally::ggpairs(aes(color = species))
# visual correlation matrix for bill depth and bill length
penguins %>%
select(bill_depth_mm, bill_length_mm) %>%
GGally::ggpairs() # calling out the library can avoid ambiguity for common-named functions, or just serve as a reminder to you
The linearity of the explanatory variable bill_depth_mm and the independent variable bill_depth_mm doesn’t look very promising to me. Let’s see what happens if we actually build the linear regression. We’ll use the function lm() which stands for “linear model” and we’ll indicate the relationship we want to test by putting a formula of the format y~x as a parameter. We’ll save the model results in a variable, then use the summary() function to display the model results:
lm_1 = lm(bill_depth_mm ~ bill_length_mm, data=penguins)
summary(lm_1)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1381 -1.4263 0.0164 1.3841 4.5255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.88547 0.84388 24.749 < 2e-16 ***
## bill_length_mm -0.08502 0.01907 -4.459 1.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.922 on 340 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.05525, Adjusted R-squared: 0.05247
## F-statistic: 19.88 on 1 and 340 DF, p-value: 1.12e-05
Both coefficients \(\beta_1\) (the y-intercept) and \(\beta_2\) (the slope associated with bill length) are statistically significant with \(p<0.05\). However, the R-squared = 0.05, which means that bill length only explains about 5% of the variance in the bill depth observations. That’s pretty pathetic. Since this is just a simple univariate regression model we can quickly plot the data and the linear model using geom_smooth() with method="lm" in ggplot.
ggplot(data=penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
There are clear clusters in the data that are being ignored by our regression model, and the line doesn’t seem to capture any interesting trend. The negative coefficient for bill_length_mm (as well as the plotted model) indicates that as bill length increases, bill depth decreases. That doesn’t really make sense, right? What did we do wrong?
Well, grouping all of the penguin data, i.e. all three species, is pretty illogical. In fact, it violates another, less cited, assumption of linear regression: “All necessary independent variables are included in the regression that are specified by existing theory and/or research.” We probably should have included species, or separated our analysis out into three separate models, one for each species.
To check some of the assumptions of your linear model post hoc, you can send your saved model to the plot() function in base R:
class(lm_1) # Note that your model output is a variable of the class "lm"
## [1] "lm"
plot(lm_1) # This actually calls plot.lm() since the first parameter is class "lm"
You can learn more about this quick-and-dirty diagnostics plotting trick by looking up the help page ?plot.lm. The function plot.lm() is the version of the plot() function that is called when the parameter that you pass plot() is of the class “lm”. This output provides you four useful plots:
There are many objective statistical tests that can be performed to check the assumptions of your data. For more resources, look at the end of this tutorial page.
Let’s do the logical thing and test the same relationship bill_depth_mm ~ bill_length_mm but only looking at one species:
gentoo = penguins %>% filter(species=="Gentoo")
# visual correlation matrix for bill depth and bill length
gentoo %>%
select(bill_depth_mm, bill_length_mm) %>%
GGally::ggpairs()
## Warning: Removed 1 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_density).
lm_2 = lm(bill_depth_mm ~ bill_length_mm, data=gentoo)
summary(lm_2)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm, data = gentoo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55952 -0.52572 -0.06658 0.46041 2.95390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.25101 1.05481 4.978 2.15e-06 ***
## bill_length_mm 0.20484 0.02216 9.245 1.02e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7543 on 121 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4139, Adjusted R-squared: 0.4091
## F-statistic: 85.46 on 1 and 121 DF, p-value: 1.016e-15
ggplot(data=gentoo, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
This trend looks a little better to me. The R-squared = 0.41, so the model explains about 41% of the variation. That’s pretty impressive considering we only have one variable in there. Also, the model passes the sanity check because it seems logical that bill depth will increase with increasing bill length.
We can actually use geom_smooth() to examine separate linear models for each of the three penguin species at once without formally running the regression:
ggplot(data=penguins) +
geom_point(aes(x=bill_length_mm, y=bill_depth_mm, color=species)) +
geom_smooth(aes(x=bill_length_mm, y=bill_depth_mm, color=species), method = "lm") +
geom_smooth(data=penguins, aes(x=bill_length_mm, y=bill_depth_mm), method = "lm", color="black")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
So that makes it very clear that running all three species together was a serious logical error, and it produced a completely untrue result that bill depth increases as bill length decreases In fact, when we split the three species apart, we can see that bill depth increases with increasing bill length (as you’d expect) and the relationships look pretty similar for each species. This is an example of Simpson’s paradox, which occurs when trends that appear when a dataset is separated into groups reverse when the data are aggregated. Basically, when doing statistics, you want to use your brain and your gut to build models that make sense.
Build a linear model predicting Gentoo bill length as a function of flipper length. Plot the predictions. Which explanatory variable (bill length vs. flipper length) does a better job of predicting bill depth? What is your evidence?
In life, we typically can’t do a very good job explaining variation in some dependent variable using just a single independent variable. Multiple linear regression is when we build a function with multiple independent variables of the form:
\[y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + ... + \varepsilon\]
When conducting multiple linear regression, all of the same assumptions and diagnostics apply, with one important addition: the interpretation of the associated effect of any one explanatory variable must be made in conjunction with the other explanatory variables included in your model.
The statistical decisions you make should account for your end goals. These are the two types of goals when building a model:
In this lesson, and in general when you are trying to “play it safe” in your analyses, you will be modeling for explanation. This means if you are modeling some \(y\) as a a function of \(x_1\) and \(x_2\), but \(x_1\) and \(x_2\) are quite collinear, then you won’t be able to differentiate the unique impact of either \(x\) variable on \(y\) because, for example, \(x_1\) may be stealing some of the variation from \(x_2\). Then the total impact of \(x_2\) on \(y\) will be unfairly biased small, and the impact of \(x_1\) will be unfairly biased large. Who knows what chaos will ensue when you use these biased models to guide science, policy, etc.?
However, if you are modeling for prediction, and don’t actually care what the relative impact of \(x_1\) or \(x_2\) is on \(y\), but you want your \(y\) predictions to be as accurate and precise as possible, then you don’t have to worry about multicollinearity. For example, weather predictions are like this. Meteorologists just throw everything they can into their models, even though their “explanatory variables” can be highly correlated, becuase they aren’t trying to demonstrate a relationship between rainfall and pressure, they are just trying to let you know whether it’ll be a good weekend for a beach trip.
bill_depth_mm, bill_length_mmspecies. These should typically be in the factor class in R.There are cases when you, the data scientist, can make a conscientious choice between assigning a variable as continuous vs. categorical. A good example is “year”. If you are looking for a trend in your data over time, year should be treated as a continuous variable. If you are trying to account for some wacky conditions that can change from year to year, but that probably don’t consitute a temporal trend, you could treat year as a categorical variable.
Our adventures in simple regression taught us that we shouldn’t model bill depth as a function of bill length without accounting for species. We built three separate models, one for each species. Another option is to build one model, but include species as an explanatory variable with the function.
This is very simple to implement in the lm() function. We’ll try a function of the form bill_depth_mm ~ bill_length_mm + species. This model will estimate a single slope associated with bill length and a different intercept for each species. The resulting prediction will look like three parallel lines, one for each species.
# Drop NA data before fitting model. This helps me avoid problems down the line with predict()
penguins_lm_3 = penguins %>%
filter(!is.na(bill_depth_mm),
!is.na(bill_length_mm),
!is.na(species))
lm_3 = lm(bill_depth_mm ~ bill_length_mm + species, data=penguins_lm_3)
There are several different ways to access your model results. You can copy and paste coefficient estimates, p-values, etc. from the summary() output, or you can extract various elements from the model variable using specialized functions like coef(). If you want your results to look like an ANOVA table (remember, ANOVA and linear regression are mathematically equivalent), you can use the anova() function. Probably the most efficient way to get at your model results makes use of the tidy() function in the broom package in the tidyverse. Note that broom is one of the packages that, although it is part of the tidyverse, it does not load up with the library(tidyverse), it must be loaded separately with library(broom). This outputs the model results in a neat data frame, which you can easily export to a .csv file to save or create a table in your manuscript. Check the help page ?tidy.lm and you can see that you can ask for confidence intervals to be output on your coefficients.
summary(lm_3)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm + species, data = penguins_lm_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4529 -0.6864 -0.0508 0.5519 3.5915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.59218 0.68302 15.508 < 2e-16 ***
## bill_length_mm 0.19989 0.01749 11.427 < 2e-16 ***
## speciesChinstrap -1.93319 0.22416 -8.624 2.55e-16 ***
## speciesGentoo -5.10602 0.19142 -26.674 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9533 on 338 degrees of freedom
## Multiple R-squared: 0.769, Adjusted R-squared: 0.7669
## F-statistic: 375.1 on 3 and 338 DF, p-value: < 2.2e-16
coef(lm_3) # vector of coefficients
## (Intercept) bill_length_mm speciesChinstrap speciesGentoo
## 10.5921805 0.1998943 -1.9331943 -5.1060201
coef(lm_3)[1] # subset the y-intercept
## (Intercept)
## 10.59218
anova(lm_3) # Analysis of variance tables (typically used when all predictors are categorical)
## Analysis of Variance Table
##
## Response: bill_depth_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_length_mm 1 73.47 73.47 80.84 < 2.2e-16 ***
## species 2 949.16 474.58 522.17 < 2.2e-16 ***
## Residuals 338 307.20 0.91
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::tidy(lm_3)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.6 0.683 15.5 2.43e-41
## 2 bill_length_mm 0.200 0.0175 11.4 8.66e-26
## 3 speciesChinstrap -1.93 0.224 -8.62 2.55e-16
## 4 speciesGentoo -5.11 0.191 -26.7 3.65e-85
broom::tidy(lm_3, conf.int = TRUE, conf.level = 0.95) # Added confidence intervals to output
## # A tibble: 4 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.6 0.683 15.5 2.43e-41 9.25 11.9
## 2 bill_length_mm 0.200 0.0175 11.4 8.66e-26 0.165 0.234
## 3 speciesChinstrap -1.93 0.224 -8.62 2.55e-16 -2.37 -1.49
## 4 speciesGentoo -5.11 0.191 -26.7 3.65e-85 -5.48 -4.73
broom::tidy(lm_3, conf.int = TRUE, conf.level = 0.95) %>%
mutate_if(is.numeric, round, 2) # round to 2 decimals
## # A tibble: 4 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.6 0.68 15.5 0 9.25 11.9
## 2 bill_length_mm 0.2 0.02 11.4 0 0.17 0.23
## 3 speciesChinstrap -1.93 0.22 -8.62 0 -2.37 -1.49
## 4 speciesGentoo -5.11 0.19 -26.7 0 -5.48 -4.73
Now let’s plot the data, the linear model predictions and the standard errors on those predictions. Although it seems super similar, recall that the plot we made at the end of our simple linear regression section actually had the results of three distinct linear models plotted onto the same figure: one for each species. This multiple regression model lm_3 that we created is a single model with two explanatory variables, bill_length_mm and species. The slopes associated with each species must be equivalent in this model formulation.
The combined libraries ggiraph and ggiraphExtra have a really convenient plot function for examining a model called ggPredict(). You can set the parameters so that standard errors are printed around the model and you can also make the plot interactive. That means if you click on a point or a line, a box will pop up with information on that datapoint:
library(ggiraph)
library(ggiraphExtra)
ggPredict(lm_3, se=TRUE, interactive=TRUE)
While ggPredict() is a fabulous function, there is not a lot of room to customize it. The more formal and more customizable method for accessing your model predictions is using the predict() function in base R.
lm_3_predictions = predict(lm_3, interval="confidence") # Calculates lm predictions for the original dataset
head(lm_3_predictions)
## fit lwr upr
## 1 18.40805 18.25507 18.56102
## 2 18.48800 18.33346 18.64254
## 3 18.64792 18.48673 18.80911
## 4 17.92830 17.75958 18.09702
## 5 18.44803 18.29442 18.60163
## 6 18.36807 18.21542 18.52072
head(penguins_lm_3)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… 36.7 19.3 193 3450 fema…
## 5 Adelie Torge… 39.3 20.6 190 3650 male
## 6 Adelie Torge… 38.9 17.8 181 3625 fema…
## # … with 1 more variable: year <int>
penguins_lm_3_predict = cbind(penguins_lm_3, lm_3_predictions)
head(penguins_lm_3_predict)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## sex year fit lwr upr
## 1 male 2007 18.40805 18.25507 18.56102
## 2 female 2007 18.48800 18.33346 18.64254
## 3 female 2007 18.64792 18.48673 18.80911
## 4 female 2007 17.92830 17.75958 18.09702
## 5 male 2007 18.44803 18.29442 18.60163
## 6 female 2007 18.36807 18.21542 18.52072
ggplot(penguins_lm_3_predict, aes(x = bill_length_mm, y = bill_depth_mm, color = species) ) +
geom_point() +
geom_ribbon( aes(ymin = lwr, ymax = upr, fill = species, color = NULL), alpha = .1) +
geom_line(aes(y = fit), size = 1) +
theme_bw()
The fitted lines in all the plots so far are different lengths. This is because we have slightly different ranges of bill length data for each species category in the dataset. By default when using predict() we get the fitted values; i.e., the predicted values from the dataset used in model fitting.
I think having different line lengths is fine here, but there are times when we want to draw each line across the entire range of the variable in the dataset. Also, sometimes our data are so sparse that our fitted line ends up not being very smooth; this can be especially problematic for non-linear fits. In both of these situations we’d want to make a new dataset for making the predictions.
Let’s make model prediction lines using the entire range of bill length instead of the within-species range. We can make a variable with the full range of bill length via seq(), making a sequence from the minimum to maximum dataset value. I use 0.1 as the increment in seq(); the increment value you’ll want to use depends on the range of your variable. Then to get the full range of bill length associated with each species category we can use expand.grid() in base R:
# Build a new bill_length_mm dataset that spans the full range of the original data at even intervals
newdata_bill_length_mm = seq(min(penguins_lm_3$bill_length_mm), max(penguins_lm_3$bill_length_mm), by = .1)
# Repeat complete bill_length_mm data for each species
newdata = expand.grid(bill_length_mm = newdata_bill_length_mm, species = unique(penguins_lm_3$species) )
head(newdata)
## bill_length_mm species
## 1 32.1 Adelie
## 2 32.2 Adelie
## 3 32.3 Adelie
## 4 32.4 Adelie
## 5 32.5 Adelie
## 6 32.6 Adelie
The key to making a dataset for prediction is that it must have every variable used in the model in it. You will get an error if you forget a variable or make a typo in one of the variable names. Note that the prediction dataset does not need to contain the response variable.
We use this prediction dataset with the newdata argument in predict(). We can add the predicted values to the prediction dataset using cbind(). When we make the plot of the fitted lines now we can see that the line for each species covers the same range. There are now two datasets used in the plotting code: the original for the points and newdata for the predicted line and 95% confidence intervals"
newdata_predict_lm_3 = cbind(newdata, predict(lm_3, interval="confidence", newdata = newdata))
dim(newdata_predict_lm_3)
## [1] 828 5
ggplot() +
geom_point(data=penguins_lm_3_predict, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
geom_ribbon(aes(ymin=lwr, ymax=upr, x = bill_length_mm, fill = species, color = NULL), alpha = .1, data=newdata_predict_lm_3) +
geom_line(aes(y = fit, x = bill_length_mm, color=species), size = 1, data=newdata_predict_lm_3) +
theme_bw()
Another way to generate model predictions is using the augment() function in the broom package. This function fits into the dplyr coding flow, so you call it with a pipe and the model estimate columns automatically append to the data that you are using. This avoids the extra call to cbind(), and perhaps a mistake in joining data frames.
# Get model predictions
lm_3_predict = lm_3 %>%
augment(penguins_lm_3, se_fit=TRUE) %>%
mutate(lwr = .fitted - 1.96 * .se.fit, upr = .fitted + 1.96 * .se.fit) # Calculate 95% C.I. using SE
# Plot the data and the model predictions
ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species), data=lm_3_predict) +
geom_point() +
geom_ribbon(aes(ymin = lwr, ymax = upr, fill = species, color = NULL), alpha = .15) +
geom_line( aes(y = .fitted), size = 1)
Similarly, we can use augment() to generate predictions with new data, so that our predicted model plots extend beyond the range of the data used to fit the model. To stay within the tidyverse, we’ll use tidyr::expand() in place of expand.grid() to generate all possible combinations of the explanatory variables used in our model.
# Get model predictions with newdata
newdata = penguins_lm_3 %>% tidyr::expand(bill_length_mm, species)
lm_3_predict = lm_3 %>%
broom::augment(newdat = newdata, se_fit=TRUE) %>%
mutate(lwr = .fitted - 1.96 * .se.fit, upr = .fitted + 1.96 * .se.fit) # Calculate 95% C.I. using SE
# Plot the data and the model predictions
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species), data=penguins_lm_3) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = bill_length_mm, fill = species, color = NULL), alpha = .15, data=lm_3_predict) +
geom_line(data=lm_3_predict, aes(y = .fitted, x = bill_length_mm, color=species), size = 1)
Why am I showing you three completely different ways to generate the same predictions and the same plot? Well, that’s the secret of programming. There is always more than one way to get something done. By trying out the different methods to accomplish the same task, you’ll become a more versatile and efficient programmer. For your own linear models, use the prediction method that makes the most sense for you. However, one benefit of the tidyverse method is that if you wind up running statistical analyses on Big Data and need to build many similar models, these methods combined with tools in the tidyverse purrr package make this very easy and efficient to code.
I’m proud of that last plot so I’m going to make it a bit prettier and save it:
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species), data=penguins_lm_3) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = bill_length_mm, fill = species, color = NULL), alpha = .15, data=lm_3_predict) +
geom_line(data=lm_3_predict, aes(y = .fitted, x = bill_length_mm, color=species), size = 1) +
theme_bw() +
xlab("Bill length (mm)") + ylab("Bill depth (mm)") +
ggsave(filename = "figures/bill_depth_model.png", device = "png", width = 5, height = 3, units = "in", dpi = 300)
Recall our model lm_3:
lm_3 = lm(bill_depth_mm ~ bill_length_mm + species, data=penguins_lm_3)
We allow bill depth to depend on both the continuous variable bill length and the categorical variable species, however, the species independent variable only has the power to change the y-intercept of the predictions for each of the three species. In this formulation, the prediction lines associated with each species are parallel. But what if this isn’t a good formulation for predicting bill depth? Perhaps, for example, Chinstrap bill depth increases more slowly with increasing bill length than Gentoo bill depth. To explore the possibility that the relationship between bill depth and bill length changes for each of the three species, we can add an interaction term. If we model the interaction between bill length and species, the bill depth prediction lines for each species are no longer forced to be parallel. Variable interactions are indicated with the * sign in R model formulas:
lm_4 = lm(bill_depth_mm ~ bill_length_mm*species, data=penguins_lm_3)
Note that bill_length_mm*species is short-hand for bill_length_mm + species + bill_length_mm X species. Interaction terms are coded this way because when you build a model with an interaction, it is important to always include the interacting variables on their own as well. You wouldn’t want a model that looked like bill_depth_mm ~ bill_length_mm X species without including bill length and species as additional terms because then the interaction coefficients would be tasked with accounting for all of the variation from each of those two independent variables in addition to the interaction between those variables. This would make interpretation difficult and may bias our understanding of the importance of the interaction.
If we look at the model results for lm_3 side by side with lm_4, we can see that the inclusion of the interaction term is not helping our model very much. The \(R^2\) measure barely improves with lm_4, and the \(Adjusted-R^2\) measure is actually lower. None of the interaction coefficients in lm_4 are statistically significant.
summary(lm_3)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm + species, data = penguins_lm_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4529 -0.6864 -0.0508 0.5519 3.5915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.59218 0.68302 15.508 < 2e-16 ***
## bill_length_mm 0.19989 0.01749 11.427 < 2e-16 ***
## speciesChinstrap -1.93319 0.22416 -8.624 2.55e-16 ***
## speciesGentoo -5.10602 0.19142 -26.674 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9533 on 338 degrees of freedom
## Multiple R-squared: 0.769, Adjusted R-squared: 0.7669
## F-statistic: 375.1 on 3 and 338 DF, p-value: < 2.2e-16
summary(lm_4)
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm * species, data = penguins_lm_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6574 -0.6675 -0.0524 0.5383 3.5032
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.40912 1.13812 10.025 < 2e-16 ***
## bill_length_mm 0.17883 0.02927 6.110 2.76e-09 ***
## speciesChinstrap -3.83998 2.05398 -1.870 0.062419 .
## speciesGentoo -6.15812 1.75451 -3.510 0.000509 ***
## bill_length_mm:speciesChinstrap 0.04338 0.04558 0.952 0.341895
## bill_length_mm:speciesGentoo 0.02601 0.04054 0.642 0.521590
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9548 on 336 degrees of freedom
## Multiple R-squared: 0.7697, Adjusted R-squared: 0.7662
## F-statistic: 224.5 on 5 and 336 DF, p-value: < 2.2e-16
If we use the AIC() function to compare the Akaike Information Criterion between the 2 models, the AIC for lm_4 is higher, meaning the fitness is poorer compared to lm_3. Another option for comparing a complex model with a simpler model is to use the step() function. You can input the most complicated model formulation that you are intersted in and this function will reduce the model’s complexity one step at a time and check the fitness at each step. Read the documentation on step() to learn more about how decisions are made.
AIC(lm_3, lm_4)
## df AIC
## lm_3 5 943.8504
## lm_4 7 946.8778
best_model = step(lm_4)
## Start: AIC=-25.68
## bill_depth_mm ~ bill_length_mm * species
##
## Df Sum of Sq RSS AIC
## - bill_length_mm:species 2 0.87243 307.20 -28.704
## <none> 306.32 -25.676
##
## Step: AIC=-28.7
## bill_depth_mm ~ bill_length_mm + species
##
## Df Sum of Sq RSS AIC
## <none> 307.20 -28.70
## - bill_length_mm 1 118.67 425.87 81.01
## - species 2 949.16 1256.36 449.00
best_model
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm + species, data = penguins_lm_3)
##
## Coefficients:
## (Intercept) bill_length_mm speciesChinstrap speciesGentoo
## 10.5922 0.1999 -1.9332 -5.1060
If we plotted the lm_4 predictions for each of the 3 species, the lines will probably look pretty close to parallel, since the interaction terms were not significant. We can build our predictions with the same methods we used for lm_3.
# Plotting predictions with ggPredict():
# ggPredict(lm_4, se=TRUE, interactive=TRUE) # easy, but less customizable
# Get model predictions with newdata
newdata = penguins_lm_3 %>% expand(bill_length_mm, species)
head(newdata)
## # A tibble: 6 x 2
## bill_length_mm species
## <dbl> <fct>
## 1 32.1 Adelie
## 2 32.1 Chinstrap
## 3 32.1 Gentoo
## 4 33.1 Adelie
## 5 33.1 Chinstrap
## 6 33.1 Gentoo
lm_4_predict = lm_4 %>%
augment(newdat = newdata, se_fit=TRUE) %>%
mutate(lwr = .fitted - 1.96 * .se.fit, upr = .fitted + 1.96 * .se.fit) # Calculate 95% C.I. using SE
# Plot the data and the model predictions
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species), data=penguins_lm_3) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = bill_length_mm, fill = species, color = NULL), alpha = .15, data=lm_4_predict) +
geom_line(data=lm_4_predict, aes(y = .fitted, x = bill_length_mm, color=species), size = 1)
As we expected, the distinct slopes of bill depth vs. bill length for each of the 3 species are almost parallel. Since the AIC values are lower (i.e. better fit) for lm_3, we would select that as our final model instead of lm_4.
When including multiple independent variables in a regression, especially multiple continuous variables, you need to check the collinearity of your predictors. If 2 predictors x1 and x2 are extremely collinear, then you can bias your interpretation of the model because you can’t objectively determine how much variation in y is caused by x1 vs. x2. To deal with this, you should calculate the Variance Inflation Factor of your independent variables. If the VIF = 1, then your independent variables are not correlated. There are different thresholds set for VIF, but generally, if the VIF > 4 or 5, you have a moderate problem with multicollinearity and if the VIF > 10, you have a big problem with multicollinearity in your model. This statistic should be reported and model results should be interpreted with the VIF in mind. If you are more interested in a mechanistic model (i.e. figuring out WHAT drives y) rather than a predictive model (i.e. getting the BEST estimate of y) then you should use your best human judgement to remove predictor variables until the model results are more interpretable.
Let’s try out a regression with 2 continuous variables. We’ll look at Gentoo penguin bill depth as a function of bill length, flipper length and body mass. We’ll use AIC() and step() to compare this complex model with the simpler models. We’ll also use vif() to check if we have a problem with multicollinearity.
library(car) # vif()
library(ggiraph)
library(ggiraphExtra) # ggPredict()
gentoo = penguins %>%
filter(species=="Gentoo")
# Build simple linear regression
lm_gentoo_1 = lm(bill_depth_mm ~ bill_length_mm, data=gentoo)
# Build multiregression with 2 variables
lm_gentoo_2 = lm(bill_depth_mm ~ bill_length_mm + flipper_length_mm, data=gentoo)
# Build multiregression with 3 variables
lm_gentoo_3 = lm(bill_depth_mm ~ bill_length_mm + flipper_length_mm + body_mass_g, data=gentoo)
vif(lm_gentoo_3) # vif values ~ 2, mild multicollinearity
## bill_length_mm flipper_length_mm body_mass_g
## 2.082527 2.271573 2.315374
step(lm_gentoo_3) # Doesn't remove any variables
## Start: AIC=-114.31
## bill_depth_mm ~ bill_length_mm + flipper_length_mm + body_mass_g
##
## Df Sum of Sq RSS AIC
## <none> 45.503 -114.313
## - bill_length_mm 1 1.8206 47.323 -111.487
## - flipper_length_mm 1 5.6252 51.128 -101.976
## - body_mass_g 1 6.8367 52.339 -99.096
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm + flipper_length_mm +
## body_mass_g, data = gentoo)
##
## Coefficients:
## (Intercept) bill_length_mm flipper_length_mm body_mass_g
## -2.2009521 0.0572019 0.0499049 0.0007145
AIC(lm_gentoo_1, lm_gentoo_2, lm_gentoo_3) # lm_gentoo_3 performs best
## df AIC
## lm_gentoo_1 3 283.6673
## lm_gentoo_2 4 251.9633
## lm_gentoo_3 5 236.7460
Plot predictions: ######### CHECK THIS AREA Now let’s plot the model predictions. This is where things get a bit tricky, because we are interested in changes along four dimensions (bill depth, bill lenth, flipper length and body mass) but we can only easily visualize 2 dimensions in our plots. The simplest solution here is to plot changes in our y variable agains changes in our x variables one at a time, while holding the other x variables constant. A good way to do this is to set 2 of your x variables to their median value in the observations, and then look at how y changes with the remaining x variable. We can do this by generating a new data set, and then using the augment() function in the tidyverse (or predict() in base R) and plotting just like we did earlier:
### Look at bill depth ~ body mass while holding bill length and flipper length constant
# Use expand to get full range of 1 variable, then add in median of other variable(s) from original data
newdata = gentoo %>%
expand(body_mass_g) %>% # Full range of body mass data (gets rid of other variables)
mutate(bill_length_mm = median(gentoo$bill_length_mm, na.rm=TRUE),
flipper_length_mm = median(gentoo$flipper_length_mm, na.rm=TRUE))
lm_gentoo_3_predict = lm_gentoo_3 %>%
augment(newdat = newdata, se_fit=TRUE) %>%
mutate(lwr = .fitted - 1.96 * .se.fit, upr = .fitted + 1.96 * .se.fit) # Calculate 95% C.I. using SE
# Plot the data and the model predictions
ggplot() +
geom_point(aes(x = body_mass_g, y = bill_depth_mm), data=gentoo) + # original data
geom_ribbon(aes(ymin = lwr, ymax = upr, x = body_mass_g), alpha = .15, data=lm_gentoo_3_predict) +
geom_line(data=lm_gentoo_3_predict, aes(y = .fitted, x = body_mass_g), size = 1) +
annotate("text", x=4250, y=17, label= paste0("flipper length = ", median(gentoo$flipper_length_mm, na.rm=TRUE), "mm")) +
annotate("text", x=4250, y=16.5, label= paste0("bill length = ", median(gentoo$bill_length_mm, na.rm=TRUE), "mm"))
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 row(s) containing missing values (geom_path).
Plot the model predictions from our model lm_gentoo_3 so that we can see the variation in bill depth vs. flipper length while holding bill length and body mass constant at their medians.
Analysis of Variance (ANOVA) is just a special case of linear regression where the independent predictor variables are all categorical variables. Mathematically, running an ANOVA is the same as running a linear model. You can build your linear model using lm() as before, and then print the output of the model using the function anova(), which presents the statistical output in a classic ANOVA table. I always like to emphasize how there are multiple ways to accomplish the same goals in programming! Another way to run an ANOVA is to use the aov() function to run an ANOVA model directly.
head(penguins)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## 5 Adelie Torge… 36.7 19.3 193 3450 fema…
## 6 Adelie Torge… 39.3 20.6 190 3650 male
## # … with 1 more variable: year <int>
# Conduct an ANOVA using lm()
penguin_lm = lm(body_mass_g ~ species + sex, data=penguins)
summary(penguin_lm)
##
## Call:
## lm(formula = body_mass_g ~ species + sex, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -816.87 -217.80 -16.87 227.61 882.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3372.39 31.43 107.308 <2e-16 ***
## speciesChinstrap 26.92 46.48 0.579 0.563
## speciesGentoo 1377.86 39.10 35.236 <2e-16 ***
## sexmale 667.56 34.70 19.236 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 316.6 on 329 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.8468, Adjusted R-squared: 0.8454
## F-statistic: 606.1 on 3 and 329 DF, p-value: < 2.2e-16
anova(penguin_lm)
## Analysis of Variance Table
##
## Response: body_mass_g
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 145190219 72595110 724.21 < 2.2e-16 ***
## sex 1 37090262 37090262 370.01 < 2.2e-16 ***
## Residuals 329 32979185 100241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Conduct the same ANOVA using aov()
penguin_anova = aov(body_mass_g ~ species + sex, data=penguins)
summary(penguin_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 145190219 72595110 724.2 <2e-16 ***
## sex 1 37090262 37090262 370.0 <2e-16 ***
## Residuals 329 32979185 100241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 11 observations deleted due to missingness
Note that the output of the anova(penguin_lm) has all of the same statistics and results as summary(penguin_anova). The output tells us that both species and sex are significant predictors of penguin body mass, but it doesn’t tell us more than that. To find out which groups are associated with heavier vs. lighter body mass, simply calculate and compare the mean of the the two groups. With groups with 3 or more categories, we need to conduct a post hoc Tukey test.
# which sex has higher body mass?
penguins %>%
group_by(sex) %>%
summarize(mean_body_mass_g = mean(body_mass_g))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## sex mean_body_mass_g
## <fct> <dbl>
## 1 female 3862.
## 2 male 4546.
## 3 <NA> NA
# which species has higher body mass?
penguins %>%
group_by(species) %>%
summarize(mean_body_mass_g = mean(body_mass_g, na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## species mean_body_mass_g
## <fct> <dbl>
## 1 Adelie 3701.
## 2 Chinstrap 3733.
## 3 Gentoo 5076.
TukeyHSD(penguin_anova) # Requires the output of the aov() function
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = body_mass_g ~ species + sex, data = penguins)
##
## $species
## diff lwr upr p adj
## Chinstrap-Adelie 26.92385 -82.51532 136.363 0.8313289
## Gentoo-Adelie 1386.27259 1294.21284 1478.332 0.0000000
## Gentoo-Chinstrap 1359.34874 1246.03315 1472.664 0.0000000
##
## $sex
## diff lwr upr p adj
## male-female 667.4577 599.193 735.7224 0
So the post-hoc Tukey test tells us that Chinstrap and Adelie penguin body mass is not significantly different (p>0.05), however, both Chinstrap and Adelie penguins body mass are each significantly different from Gentoo penguins (p<0.05). We know from simply calculating the means for those species separately that, in fact, Adelie and Chinstrap body mass is significantly less than Gentoo body mass. Tada!
ANOVA is pretty robust to not-quite-normal data, but if you have a big normality problem you can switch to the Kruskall Wallace test kruskal.test().
Conduct an Analysis of Variance to determine whether Adelie body mass is significantly different between the three islands where observations were collected. Conduct a post-hoc Tukey test if appropriate.
Overview of basic regression capabilities in R: https://www.statmethods.net/stats/regression.html
Diagnostic tests for linear regression: https://www.statmethods.net/stats/rdiagnostics.html https://www.ianruginski.com/post/regressionassumptions/